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Dear Jessie, 


MetaPhIAn 3 default output is made of relative abundances: therefore dividing each by 100 gives a representation of the probability of 

each bug. Keeping this in mind, you can easily compute Shannon entropy (as Shannon diversity) based on the standard formula: 

-sum( each probability * log( each probability) ), excluding the zero values. If you don't feel confident you can easily use functions in scipy or scikit-bio, or test 
yours against these. 


For what concerns rarefaction, the most straight forward method is to directly subset the fastq file of each sample down to the desired number of reads: you can 
count the number of reads in the file with a fast function: 


def rawpycount(filename): 
f = openr(filename,'rb’) 
f_gen=_make_gen(f.read) 
return sum( buf.count(b'\n’) for buf in f_gen) 


and then printing out the 4 lines fastq randomly choosen: 


def random_sample(par): 
N,n = rawpycount(par[inp_f])/4, par[nreads'] 
if N<n: exit(1) 
sample = (np.fromiter(np.random.choice(N,n,replace=False), dtype=np.int64)) 
length_sample = len(sample) 
sample = set(sample) 
length, line = 0, 1 
f, i, c= openr(par[inp_f],'r’), -1, 0 


while (line and length<length_sample): 
line = f.readline() 
if C%4==0: 
i+=1 ## read. 
if i in sample: 
length += 1 
print line.rstrip() 
for ii in range(3): 
linetokeep = f.readline() 
print linetokeep.rstrip() 
ct=3 
ct=1 


Another method is to identify a percentage of reads that corresponds to the desired number, then use the "fake coin". 
Suppose that you want to rarefy to the 40%, for each fastq you generate a random number between 0 and 1. Ifitis <= 0.4, 
you preserve the read, otherwise you discard it. This method can be applied also to the bowtie output from metaphlan, 
making you gaining time. 


matching_reads_subset =[m for m,rand in zip( bowtie_matching_reads, np.random.rand( tot_matches )) if rand<=choosen_percentage_wrt_min] 


Bye 
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